Conversation
NucleusImage - text kv caching
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
| gate1 = gate1.clamp(min=-2.0, max=2.0) | ||
| gate2 = gate2.clamp(min=-2.0, max=2.0) |
There was a problem hiding this comment.
It seems weird to me that we first clamp the gates to [-2.0, 2.0] and then essentially clamp again by squashing with the tanh function below. Is this intended?
There was a problem hiding this comment.
I agree it's weird. :) I used it to stabilize the gradients if the tanh gates get saturated while training. I will evaluate the model performance without it and get back to you!
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/models/transformers/transformer_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/nucleusmoe_image/pipeline_nucleusmoe_image.py
Outdated
Show resolved
Hide resolved
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| self.experts = nn.ModuleList( | ||
| [ | ||
| FeedForward( | ||
| dim=hidden_size, | ||
| dim_out=hidden_size, | ||
| inner_dim=moe_intermediate_dim, | ||
| activation_fn="swiglu", | ||
| bias=False, | ||
| ) | ||
| for _ in range(num_experts) | ||
| ] | ||
| ) |
There was a problem hiding this comment.
you would need the projections to be in packed/contiguous format for torch.grouped_mm support (num_experts, dim_in, dim_out), @sayakpaul is that possible ? in Transformers we use the inline weight converter
There was a problem hiding this comment.
Not at the moment because MoEs are still a bit of a special case in this part of world.
There was a problem hiding this comment.
I can pack the MoE weights. That's how I originally trained the model with Expert Parallel.
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
…mage.py Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
| if max_txt_seq_len is None: | ||
| raise ValueError("Either `max_txt_seq_len` must be provided.") |
There was a problem hiding this comment.
Would it be possible to provide a reasonable default value for max_txt_seq_len instead of raising an error?
|
|
||
|
|
||
| class TestNucleusMoEImageTransformer(NucleusMoEImageTransformerTesterConfig, ModelTesterMixin): | ||
| def test_txt_seq_lens_deprecation(self): |
There was a problem hiding this comment.
I think we can remove this test now that txt_seq_lens has been removed from the transformer's forward method.
What does this PR do?
This PR introduces NucleusMoE-Image series into the diffusers library.
NucleusMoE-Image is a 2B active 17B parameter model trained with efficiency at its core. Our novel architecture highlights the scalability of sparse MoE architecture for Image generation. The technical report will be released very soon.